The DOCCI dataset contains 15,000 images with long, detailed English descriptions annotated by humans. The images and descriptions focus on assessing critical limitations in current text-to-image models, including spatial relationships, counting, text rendering, world knowledge, and more. The descriptions distinguish each image from highly similar ones.